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two of the standards contaipgd in the third draft of 
the new Joint technical Standards for Test iraelopment and Revision 
are discussed: Standard 3.13, mandating the ul!l^ of multicultural 
material and the avoidance of material offensive to any major ethnic, 
cultural, or gender group; and Standard 3 . 14 , mandating research and 
subsequent test revision to eliminate aspects of test design, 
content, or format that might serve to bias test c;cores positively or 
negatively for any given group. In evaluating these standards, test 
developers mast keep in mind th^ purpose of the test being developed. 
They should, also realize that these two standards imply that 
including multicultural material will ensure that major subgroups 
will see material familiar to them and thus score better on the test 
(an assumption that is unproven). Research addressing this assiunption 
must study items with two character istics : (1) that differential 
performance has been detected on these items, and (2) that the 
context of the items can be changed to a context relevant to subgroup 
culture without altering the« essential task. This is almost 
impossible. The fourth draft of the Standards revised these two 
standards. Standard 3.13 (now 3.5) was made more general and le^ss 
prescriptive; and the two standards were separated. (BW) 
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Theory ^and Practice:. 
The Revised Joint Technical Standards and Test Construction* 



Marl Ann Pearlman 
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TO THE EDUCATIONAL RESOURCJa^ 
■ *■ ■ ' * - INFORMATION CENTER (EfyC).''"^ 

This pap;gr is based on the third draft of the Joint; Technical . 
Standards, the latest draft available at the time of writing. In 
the third draft of the new '^oint Technical ^andards for Test - / 
Development and Revision, two items stand out as of particular . 
interest and concern to test developers, and it is upon tho6e 
Standards that I >ill focus my remarks. They are Standard 3,13, 
which mandates the use of multicultural material and the avoidance 
of material offensive to any major ethnic, cultural, or gander 
group and Standard 3*14, which mandates research i&ind subseq^aent - 
test revision to eliminate aspects of test design, content,. -or 
format that^might serve to bias test scores positively or 
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No test developer would/ I believe, oppose the philosophical 
position these standards embody. A test whose' content accurately 
reflects, to quote the language of Standard 3.13, "the cultural 
backgrounds and prior experiences of the major ethnic, cultural , 
and gender groups represented in the intended population of test: 
•takers" is- likely to be not only brpader in its scope but also 
much more interesting to develop and take-. And a test whose items 
do not provide undue differential advantages or disadvantages to 
specific groups of test takers should clearly be a desideratum for 
all test development. 

In fact^f perhaps the most * important challenge facing , 

professional test developers in the next 20- years is the 

... • , . , . - • . ■ - . . . . ^ • 

development of measurement- instruments that are not biased "toward 
any test taker. At present, however, I believe th"^ we must be, 
aware of the technical limitations , of the state of our art. We V 
must not delude ourselves on such dn important issue; the modesty 
of our accomplishments in tl^is area thus far must be acknowledged, 
and the reasons for that modesty explored. Attempting to put 
these standards to work in the daily business of writing items and 
compiling tests reveals difficulties of intepretation. In 
evaluating Standards, 3.13 and 3.14 we must keep in mind a very 
simple question: What is the purpose of the test We are 
developing? It has a very limited, practical functip/n: it is a 
device to^measure certain* narrowly defined skills. To J^erform 
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thjLs function adequately, each form of the test-must; be parallel 
witji previous 'forms , and scores on- any one form must correlate ^at 
least as well they have in the past- with whatever external 
s'ta'ndari^l is used. to measure validity. In the case of national 
admissions tests, iEor example, thi^ standard is the grade point 
average of freshmen in college^. . Therefore, the general framework^ 
within which we considet Standards' 3. 13 and 3.14 is determined by 
awareness of ju,st what flexibility and Isttitude we have within 
these predetermined constraints. ^ 



Turning first to Star^dard 3.13,* the inclusion of 
mu^icultural materials, we are confronted immediately with 
certain practical problems. In national admissions testing, the 
intende^ population of test takers can include just about anyone. 
How* are we to def in^ a "major" cultural and/or ethnic group? Is 
our responsibility discharged if only protected minprities are 
addressed? How much reflection of diversity of background, and 
whose diversity, do we put into the test? Because multicultural 
material '.must be inoffensive to everyone, it may be impossible -to 
consistently reflect the cultural backgrounds and prior 
experiences of major cultural and ethnic subgroups in the 
population of th^^nited States, or to reflect accurately the 
prior experiences of women. Those interactions between 'the 
subgroups and the majority culture that were most significant in 
the prior experience of the subgroups are very likely to be in 
some way offensive if accurately explored in texts of sufficient - 
complexity to meet the goals of this Standard. 

•4 



The Comment appended to Standard 3.13 suggests the . , 
establishment of a revi^ew process for 2rxi^ materials to detect and 
eliminate material likely to-be offensive to groups in th)e^ 

'test-taking population. This can be done\ and indeed is done 

• , ' I • / , - ^ . \ ' 

routinely at* Educational TesCing Service^ ' where we have developed 

a special Test Sensitivity Review procedure that is obligatory for 

all "'tests. The following are some of the standards used by 

trained sensitivity reviewers in reviewing test materials. [Copy 

'of ETS Sensitivity Guidelines handed out to audience] 

/ 

While the application of -these standards certainly eliminates 
very obvious and offensive stereotypical language from test 
content, an accomplishment not to be lightly dismissed, it does 
5^not eVen ^ttempt to truly "reflect the cultural background and 
prior experiences" of major subgroups. iWhat, happens .in actual 
practice is that the language of all materials is very carefully 
scrutinized? women do appear by feminipe pronouns in mathematics 
items, men are not always the movers, shakers, thinkers, arid 
authors of all. Clearly, Standard 3. 13^ntends to engender more , 
searching efforts than these' on the p^rt* of test developers. And 
here we must return to the general framework I suggested before: 
to what end should we seek to include materials that really 
"reflect the cultural background and prior experiences" of major 
subgroups? Taken seriously, this could mean . including materials 
that possibly only a small group of test takers could icjentify 
with, thus intro'Jlucing a new bias into the test. Why does a 
Standard 3.13 mandate such attempts at "fairness?" 



, A partial answer, to that question is implied, I believe/ in 
the spaces between Standards 3.13 and 3.14. In a second point 
/nade in the conunent appended to Standard 3.13- it is argued that a • 

.V. ' ' ' 

revi^ process like the one described above is no substitute for 
attention to the different cultural-experiertt ial basessf elevant to 
test material in the item-construction stage and, before' tiiat, in 
the test and domain specif ication stage ^ of^ test development; 

This .comment implicitly ties Standard ,3.13 to 3."14, which 
concerns^ itself with differential performance on test items. Such 
a connection needs to be very carefully scrutinized. It is by no 
means clear that tfie inclusion of multicultural materials will in 
and of itself ha^^e any 'effect on the differential' performance of' 
subgroups on sf^cif id test items. The inclusion of materials that 
attempt to broaden the subject matter base of the test and to 
avoid perpetuating stereotypical and biased ways of thinking about 
ethnic and cultural subgroups is worth doing in and of itself, 
because it is intellectually honest and responsible. But to 

suggest that such a procedure bear the iDurden of reducing or 

j ■ - ■ ^ ^ 

eliminating differential , performance is unrealistic as well as 

empirically suspec^t. It is not clear whether material actually 

.known to be offensive to ma jor ^subgroups in the test-taking 

population, woulc^ dif f erent ia-lly affect performance, for it is a 

hypothesis virtually impossible to test responsibly. One assumes 

that such material would prejudice performance, but we cjo not test 

this hypothesis for reasons analogous to those used by laboratory 



scientists who do not test the effects of massive doses of 

suspected carcinogens on -human subjects. Furthermore, there is no 

■ / ' . / ^ ■ 

•firm information or broad agreement abbut what specifics of the 

• - • ft 

.cultural backgrounds and prior experiences of v^arious subgroups 

affect performa^nce on ^standardized tests nor how they do whatever 

$iffecting they may do. Beyond very gener'al characteristics that ' 

'^affect performance on standardized tests, l^ke socioeconomic 

status and amount and breadth of schooling, we can not identify 

other differences, if there ^are "any , that create an intellectual. 

problem-solving style unique to a particular subgroup, a style 

that would affect perfoznnance on standardized tests. 

The implicit assumption in these two Standards is that^/ 
; including multicultural' material would^ automatically ^ensure that 

major subgroups would se^ material familiar^ to them dnd thus'^gore 

' • 

better on the test. 'A moment's thought will convince us that this, 

' ' ' ■ ■ ^. , - ' . ^ ■ ' 

is .first, an unp?oven assumpt ioiv°and second, an unwprkable ^ 

suggestion. Many constraints govern the content of any form of a 

tes\ — in reading comprehension, the onJLy place in which passages 

l^ng enough to deal, with "these subjects appear, concern for -the 

^ differing experience's and interests of students mandates- inclusion 

. 'v ^ ' . ' ' ^ 

of material from, a variety df areas, such as natural science,' 

I ' ■■ ^ ■ \ 

social science, huma-nit iejs . Futhermare , ' any one test fprm will 

include, almost/ f ive. or /six reading passages distributed among 

these areas of .interest. Because each test ^taker sees only one 

■ ■ . -h ■ ' ; ■ ^' 

form of a test, he or shfe is unlikely to encounter a passage that 



reflects his or her cultural background and prior experiences, 
^ThjuSr satisl^ying the standard .becomes an aesthetic^ achievement for 
a testing' corporation; all the diverse subgroups are represented 
ov^er a series of ^ test forms. Such a procedure has virtually no ^ 
impact at ail on the test takers themselves. ^ 

The materials now included in national admissions tests 
conform to ascertain basic model. This model helps define the set 
of importar^ '^skills r verbal and quantitativer students need to l^e 
successful, th^^ i^r to get good grades, in college. Substantial 
modification of the content of the materials should be based on 
explicit ratiojiales and justifications. ^ 

With air these reservations made clear, I should like t^\^ 
examine the assumptions of Standard 3.14 in light of my experience 
on just such a research project as the Standard suggests is 
desirable. The first step in such a stiidy is the detection of 
item bias, the least difficult part of the research, and itself a 
vexed issue. At present test developers and researchers are far 
from unanimity on the best statistical model to use for detecting 
>.item bias. And of course different items will be identified as 
biased depending upon which statistical model is used. Once some 
method has beem chosen and biased items are identified, even 
greater difficulties and aunbiguities arise in an attempt to 
formulate hypotheses that might explain the bias: what is it 
about these particular items that~ causes different groups of 
test-takers to perform unusually well or unusually poorly in 
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comparison to their performance on tfcie total section or test? It 
comes as no suriirise, I'm sure, that generating hypotheses to 
explain the relatively poor performances of major cultural and 
ethnic subgroups on certain test items is not very difficult. We 
think we, know, or at least -puspect, what characteristics of test - 
design and item format might produce disadvantages for these 
test-takers. It is much more, challenging, however, to hypothesize 
what it is about a^^^oup of items that puts the higher scoring » 
group, such as men w quantitative items, at a disadvantage. ^ And 
hypothesizing about what makes for pe^rfqrmance substantially above 
the expected leVel on certain items among subgroups of test takers 
is equally problematic, "in general, the items look pretty 
•similar. Th^y have been developed using the same guidelines^ and 
forj^at, the range of difficulty of the items *as revealed in 
pretesting doesn' t explain much about dif f erent ia]p performance, 
and the subject matter seems to^ have little bearing in most cases 
on performance. 

' Hawever, even though hypothesis formulation is fraught with 
problems, once it is completed the clear-cut parts of such 
research are over. For now, items similar to those identified as 
biased must be revised in order to test the validity of the 
hypothesis. Note that the same items are not typically Vpevisec}. 
Thus, if the bias arose originally because of some quality 
peculiar to an item or set and not reproduced in another, all 
hypotheses are confounded. Also, the process of revision, at. / 
least in verbal items, introduces so many confounding variables 



that interpretations of the results must be very carefully hedged 
and limited. Let us say, for example, that one hypothesis is, as 
it was in the study iij which I participated, that Reading 
Comprehension questions which have stems using the LEAST, NOT, or 
^XCEPT format, like this one [ 1 ] are likely to bias performance 
against Black test takers. We hypothesized, for the purposes of 
the study, that Black test takers might be at a substantial 

disadvantage in* performing such a task, which asks not for the one 

I 

right response, but for the one anomalou^, different, or wrong 
response. In revising such an item to test this hypothesis, the 
stem-^was altered to read like this [ 2 ]. However, because we are 
now asking for the one right response, we changed at least three 
options, thus essentially creating a new question. Even if the 
results indicate the expected change in performance, I think it 
unlikely that we could say with any degree of-certainty that such 
results substantially increase the probability of our hypothesis. 

But what we really want to find out by such research is even 
harder to discover. We would like to test the implicit connection 
made in these Standards between multicultural materials and 
differential performance. To examine the accuracy of this 

f>thesis, that a particular Subgroup will feel more confident 
Ling vgith and thus do better on a task set in a context 
familiar to it, the researcher must select items with the 
following characteristics: 1) differential performance has been 
detected on these items, and 2) the context of the items can be 
changed to a context relevant to subgroup culture and experience 

10 
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without altering the essential task. This is a tall order. ^ 
Context and essence seem to be so interwoyen as to be inseparable 
in items involving reading ancj^ comprehension. In pertain obvious 
cases content causes item bias. An analogy which uses terms such , 
as "biretta" like this one I 3 1 clearly favors those test takers ' 
with a Roman Catholic background. Ironically, of course, this 
might include a substantial proportion of another major minority 
subgroup, like Hispapic Americans. Clearly, too, items like these 
[ 4/] favor a smallNcegment of the test taking population and 
should be avoided. But the more fundamental questions about \ 
biases built into test designs and item format and content remain 
very difficult to get at in an organized empirical fashion. 

In the fourth draft of the Standards, which became available 
in March after this paper was written, the two standards. I have 
here discussed have been revised, one of them substantially. I 
applaud the revision of the Standard' concerning the inclusion of 
multicultural materialis (originally 3.13, now 3.5), which has been 
made mu?ch mere general and less prescriptive. Also, the two 
standards have been separated in, the chapter on Test Development, 
a wise revision, given the implications of putting them back to 
back, 'which I have discussed. Any return to the specificity of 
the Third Draft Standards would be a. serious mis judgment^ of the. 
task of test developers in my view. 
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The author usjes which of the 
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MORTARBOARD:ACADEMIC : : 

^A) turban:inonastic 

(B) cap:youthful 

(C) wimple:classical 

(D) biretta:ecclesiastical 

(E) helmet:medieval 
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LACROSSE :STICK:: 

(A) boxing:glove 

(B) swimming:water 

(C) tennis :net 

(D) squashrracket 

(E) basketball:goa 
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OPERA:ARIA:: 

(A) ballet:pirouette 

(B) play : soliloquy 

(C) p o r t r a i t : c a^n v a s 

(D) o r c h e s t r a : m a e s t r o 

(E) concert: soloist 
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Presentation II: Theory and Practice: The Revised Joint Technical 
Standards and Test Construction 



Mari Pearlman 

Educational Testing Service 

* 

This presentation will discuss aspects o£ the revised Joint Technical 
Standards as they affect the ongoing process of^liest construction for 
national admissions testing. i ^ 

Perhaps the most farreaching assumptions 'i?^ the revised Joint Technical 
Stan(iards from this. perspective are those that imply the desiderata of the 
training and background of test developers • These tacit assumptions and 
their implications for policies and procedures will be discussed. Two 
specific parts. of the standards, those mandating the consideration of test 
material from a multi-cultural, ethnic- and gender-sensitive viewpoint, and 
those directing test developers to study differential performance on test 
items, will be examined, with specific examples of problems and solutions 
in these areas presented. Finally, some social and policy implications of 
both the assumptions and the requirements of the revised Joint Technical 
Standards will be addressed. 
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